[SOUND]
>> This
lecture is about the Overview
of Statistical Language Models,
which cover proper
models as special cases.
In this lecture we're going to give
a overview of Statical Language Models.
These models are general models that cover
probabilistic topic models
as a special cases.
So first off,
what is a Statistical Language Model?
A Statistical Language Model is
basically a probability distribution
over word sequences.
So, for example,
we might have a distribution that gives,
today is Wednesday a probability of .001.
It might give today Wednesday is, which
is a non-grammatical sentence, a very,
very small probability as shown here.
And similarly another sentence,
the eigenvalue is positive might
get the probability of .00001.
So as you can see such a distribution
clearly is Context Dependent.
It depends on the Context of Discussion.
Some Word Sequences might have higher
probabilities than others but the same
Sequence of Words might have different
probability in different context.
And so this suggests that such a
distribution can actually categorize topic
such a model can also be regarded
as Probabilistic Mechanism for
generating text.
And that just means we can view text
data as data observed from such a model.
For this reason,
we call such a model as Generating Model.
So, now given a model we can then
assemble sequences of words.
So, for example, based on the distribution
that I have shown here on this slide,
when matter it say assemble
a sequence like today is Wednesday
because it has a relative
high probability.
We might often get such a sequence.
We might also get the item
value as positive sometimes
with a smaller probability and
very, very occasionally we might
get today is Wednesday because
it's probability is so small.
So in general, in order to categorize such
a distribution we must specify probability
values for
all these different sequences of words.
Obviously, it's impossible
to specify that because it's
impossible to enumerate all of
the possible sequences of words.
So in practice, we will have to
simplify the model in some way.
So, the simplest language model is
called the Unigram Language Model.
In such a case, it was simply a the text
is generated by generating
each word independently.
But in general, the words may
not be generated independently.
But after we make this assumption, we can
significantly simplify the language more.
Basically, now the probability of
a sequence of words, w1 through wn,
will be just the product of
the probability of each word.
So for such a model,
we have as many parameters as
the number of words in our vocabulary.
So here we assume we have n words,
so we have n probabilities.
One for each word.
And then some to 1.
So, now we assume that
our text is a sample
drawn according to this word distribution.
That just means,
we're going to draw a word each time and
then eventually we'll get a text.
So for example, now again,
we can try to assemble words
according to a distribution.
We might get Wednesday often or
today often.
And some other words like eigenvalue
might have a small probability, etcetera.
But with this, we actually can
also compute the probability of
every sequence, even though our model
only specify the probabilities of words.
And this is because of the independence.
So specifically, we can compute
the probability of today is Wednesday.
Because it's just a product
of the probability of today,
the probability of is, and
probability of Wednesday.
For example,
I show some fake numbers here and when you
multiply these numbers together you get
the probability that today's Wednesday.
So as you can see, with N probabilities,
one for each word, we actually
can characterize the probability situation
over all kinds of sequences of words.
And so, this is a very simple model.
Ignore the word order.
So it may not be, in fact, in some
problems, such as for speech recognition,
where you may care about
the order of words.
But it turns out to be
quite sufficient for
many tasks that involve topic analysis.
And that's also what
we're interested in here.
So when we have a model, we generally have
two problems that we can think about.
One is, given a model, how likely are we
to observe a certain kind of data points?
That is,
we are interested in the Sampling Process.
The other is the Estimation Process.
And that, is to think of
the parameters of a model given,
some observe the data and we're
going to talk about that in a moment.
Let's first talk about the sampling.
So, here I show two examples of Water
Distributions or Unigram Language Models.
The first one has higher probabilities for
words like a text mining association,
it's separate.
Now this signals a topic about text mining
because when we assemble words from
such a distribution, we tend to see words
that often occur in text mining contest.
So in this case,
if we ask the question about
what is the probability of
generating a particular document.
Then, we likely will see text that
looks like a text mining paper.
Of course, the text that we
generate by drawing words.
This distribution is unlikely coherent.
Although, the probability
of generating attacks mine
[INAUDIBLE] publishing
in the top conference is
non-zero assuming that no word has
a zero probability in the distribution.
And that just means,
we can essentially generate all kinds of
text documents including very
meaningful text documents.
Now, the second distribution show,
on the bottom, has different than
what was high probabilities.
So food [INAUDIBLE] healthy [INAUDIBLE],
etcetera.
So this clearly indicates
a different topic.
In this case it's probably about health.
So if we sample a word
from such a distribution,
then the probability of observing a text
mining paper would be very, very small.
On the other hand, the probability of
observing a text that looks like a food
nutrition paper would be high,
relatively higher.
So that just means, given a particular
distribution, different than the text.
Now let's look at
the estimation problem now.
In this case, we're going to assume
that we have observed the data.
I will know exactly what
the text data looks like.
In this case,
let's assume we have a text mining paper.
In fact, it's abstract of the paper,
so the total number of words is 100.
And I've shown some counts
of individual words here.
Now, if we ask the question,
what is the most likely
Language Model that has been
used to generate this text data?
Assuming that the text is observed
from some Language Model,
what's our best guess
of this Language Model?
Okay, so the problem now is just to
estimate the probabilities of these words.
As I've shown here.
So what do you think?
What would be your guess?
Would you guess text has
a very small probability, or
a relatively large probability?
What about query?
Well, your guess probably
would be dependent on
how many times we have observed
this word in the text data, right?
And if you think about it for a moment.
And if you are like many others,
you would have guessed that,
well, text has a probability of 10
out of 100 because I've observed
the text 10 times in the text
that has a total of 100 words.
And similarly, mining has 5 out of 100.
And query has a relatively small
probability, just observed for once.
So it's 1 out of 100.
Right, so that, intuitively,
is a reasonable guess.
But the question is, is this our best
guess or best estimate of the parameters?
Of course,
in order to answer this question,
we have to define what do we mean by best,
in this case,
it turns out that our
guesses are indeed the best.
In some sense and this is called
Maximum Likelihood Estimate.
And it's the best thing that, it will give
the observer data our maximum probability.
Meaning that, if you change
the estimate somehow, even slightly,
then the probability of the observed
text data will be somewhat smaller.
And this is called
a Maximum Likelihood Estimate.
[MUSIC]

